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Abstract 

In  many  data-intensive  applications,  the  use  of  principal  component  analysis  (PCA)  and  other  related 
techniques  is  ubiquitous  for  dimension  reduction,  data  mining  or  other  transformational  purposes.  Such 
transformations  often  require  efficiently,  reliably  and  accurately  computing  dominant  singular  value  de¬ 
compositions  (SVDs)  of  large  unstructured  matrices.  In  this  paper,  we  propose  and  study  a  subspace 
optimization  technique  to  significantly  accelerate  the  classic  simultaneous  iteration  method.  We  analyze 
the  convergence  of  the  proposed  algorithm,  and  numerically  compare  it  with  several  state-of-the-art  S  VD 
solvers  under  the  MATLAB  environment.  Extensive  computational  results  show  that  on  a  wide  range  of 
large  unstructured  matrices,  the  proposed  algorithm  can  often  provide  improved  efficiency  or  robustness 
over  existing  algorithms. 

Keywords,  subspace  optimization,  dominant  singular  value  decomposition,  Krylov  subspace,  eigen¬ 
value  decomposition 

1  Introduction 

Singular  value  decomposition  (SVD)  is  a  fundamental  and  enormously  useful  tool  in  matrix  computations, 
such  as  determining  the  pseudo-inverse,  the  range  or  null  space,  or  the  rank  of  a  matrix,  solving  regular  or 
total  least  squares  data  fitting  problems,  or  computing  low-rank  approximations  to  a  matrix,  just  to  mention  a 
few.  The  need  for  computing  SVDs  also  frequently  arises  from  diverse  fields  in  statistics,  signal  processing, 
data  mining  or  compression,  and  from  various  dimension-reduction  models  of  large-scale  dynamic  systems. 
Usually,  instead  of  acquiring  all  the  singular  values  and  vectors  of  a  matrix,  it  suffices  to  compute  a  set  of 
dominant  (i.e.,  the  largest)  singular  values  and  their  corresponding  singular  vectors  in  order  to  obtain  the 
most  valuable  and  relevant  information  about  the  underlying  dataset  or  system.  The  purpose  of  this  paper 
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is  to  present  an  algorithm  for  efficiently,  reliably  and  accurately  computing  dominant  SVDs  of  large  and 
unstructured  matrices. 

To  fix  notation,  let  us  consider  a  real  matrix  A  6  Wnxn  (m  <  n  without  loss  of  generality)  and  a  given 
positive  integer  k  <C  min(m,  n).  The  task  is  to  find  two  orthogonal  matrices  Uk  =  uk\  G  Wnxk 

and  Vk  =  [v i , . . . ,  Vk]  G  Wixk,  and  a  diagonal  matrix  V/,.  G  M.kxk  whose  diagonal  entries  are  the  k  largest 
singular  values  of  A,  say,  a\  >  •  •  •  >  ak  >  0,  such  that 

k 

Ak  =  V  cr iUivj  =  argmin  \\A  -  W ||p,  (1) 

:  |  rank(ty)<fc 


where  ||  •  |p  denotes  the  Frobenius  norm  for  matrices.  For  convenience,  we  will  refer  to  such  an  approxima¬ 
tion  .4  ~  Ak  =  Uk Yk  Vk  as  the  k- th  dominant  (or  truncated)  singular  value  decomposition  of  A,  or  simply 
the  dominant  SVD  without  explicitly  mentioning  the  number  k,  which  is  typically  much  smaller  than  rn. 

During  decades  of  research,  numerous  iterative  algorithms  have  been  developed  for  computing  dominant 
SVDs  of  A,  mostly  based  on  computing  eigen-pairs  (i.e.,  eigenvalues  and  eigenvectors)  of  the  symmetric 
matrix  B  =  AAT  (or  AA A),  or  alternatively  those  of 


B  = 


widi  or  without  additional  shifts.  The  classic  simple  subspace,  or  simultaneous,  iteration  method,  extends 
the  idea  of  the  power  method  which  computes  the  largest  eigenvalue  and  its  eigenvector  (see  [16,  17,  24,  26] 
for  example),  performing  repeated  matrix  multiplication  followed  by  orthogonalization.  More  elaborative 
algorithms  including  Arnoldi  methods  [14, 13],  Lanczos  methods  [21, 12],  Jacobi-Davidson  methods  [23, 2], 
to  cite  a  few  for  each  type,  are  specifically  designed  for  large-scale  but  sparse  matrices  or  other  types  of 
structured  matrices. 

Traditionally,  research  on  large-scale  eigenvalue  (or  singular  value)  problems  has  primarily  focused  on 
computing  a  subset  of  eigenvalues  and  eigenvectors  (or  singular  values  and  vectors)  of  large  sparse  matrices. 
Special  algorithms  for  computing  dominant  SVDs  of  large  unstructured,  hence  dense,  matrices  seem  few 
and  far  between.  This  state  of  affair  is  evident  when  examining  survey  articles  and  reference  books  on 
eigenvalue  and/or  singular  value  computations  (e.g.,  [19,  7,  25,  22,  10]).  On  the  other  hand,  large-scale 
but  unstructured  matrices  do  arise  from  applications,  such  as  image  processing  and  computational  materials 
science  (e.g.,  [27,  20]),  which  involve  large  amounts  of  data  even  though  their  dimensions  are  relatively 
small  compared  to  those  of  large  sparse  matrices.  This  is  especially  true  in  recent  years  when  the  demand 
for  computing  dominant  SVDs  of  large  unstructured  matrices  has  become  increasingly  pervasive  in  data- 
intensive  applications.  Clearly,  for  any  large  enough  dense  matrix  A ,  the  approach  of  computing  the  full 
SVD  of  A  then  truncating  to  the  k-th  dominant  SVD  is  not  a  practical  option  when  k  is  considerably  smaller 
than  the  sizes  of  A. 

In  some  applications,  it  suffices  to  have  rough  approximations  to  data  matrices  using  not-so-accurate 
dominant  SVDs.  One  approach  for  computing  approximate  SVDs  is  to  use  the  Monto  Carlo  method  [6] 
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which  approximates  the  dominant  singular  values  and  corresponding  singular  vectors  by  extracting  a  subma¬ 
trix  from  the  target  matrix.  Recently,  a  randomized  method  for  approximating  dominant  SVDs  is  analyzed 
in  [9]  where  one  applies  the  simultaneous  iteration  once  or  a  few  times,  starting  from  a  randomly  generated 
initial  guess.  In  these  two  works,  theoretical  results  arc  developed  to  estimate  approximation  quality. 

In  the  present  paper,  we  aim  to  develop  a  general-purpose  algorithm  for  computing  dominant  SVDs  of 
large  unstructured  matrices,  to  a  high  precision  if  so  desired.  Our  approach  is  using  a  block  Krylov  subspace 
optimization  technique,  to  be  explained  below,  to  accelerate  the  basic  simultaneous  iteration  method  for 
computing  the  k  largest  eigenvalues  of  AAT  (or  ATA)  and  their  corresponding  eigenvectors  that  we  will 
also  refer  to  as  the  leading  eigenvectors. 

It  is  well  known  that  the  k  leading  eigenvectors  of  ,4.4T  maximize  the  following  Rayleigh-Ritz  function 
under  orthogonality  constraint: 


max  ||ATV||p,  s.t.  XTX  =  I.  (2) 

xeRmXk 

In  our  proposed  approach,  we  utilize  (2)  and  a  subspace  optimization  technique  to  accelerate  the  classic 
simple  subspace  iterations.  Specifically,  in  addition  to  the  usual  subspace  iteration  in  which  the  current 
iterate  of  X  is  multiplied  by  ,4,4 T,  we  also  solve  a  restricted  version  of  (2)  in  a  subspace  spanned  by  a  few 
previous  iterates  of  X.  For  example,  at  iteration  i,  we  may  add  an  extra  subspace  constraint  to  (2)  so  that 

X  e  span  ,  zL4TX^_1)  j 

where  the  above  inclusion  means  that  all  the  k  columns  of  X  are  linear  combinations  of  the  2k  columns 
on  the  right-hand  side.  In  our  approach,  the  number  of  previous  iterates,  or  the  length  of  the  memory,  can 
be  adjusted  from  iteration  to  iteration.  Recall  that  a  Krylov  subspace  generated  by  a  matrix  B  and  a  vector 
x  is  defined  as  the  span  of  {x,  I  lx.  B2x,  •  •  •  ,  B'1  ]x\  for  some  positive  integer  d.  Therefore,  we  will  call 
our  approach  a  limited  memory  block  Krylov  subspace  optimization  technique.  A  key  observation  is  that, 
like  the  classic  simple  subspace  iteration  method,  the  proposed  algorithm  only  requires  one  matrix-block 
multiplication  of  the  form  AAV  X  per  iteration;  that  is,  the  subspace  optimization  step  requires  no  extra 
matrix-block  multiplication  of  this  kind. 

1.1  Contributions 

The  classic  simple  subspace  (or  simultaneous)  iteration  (SSI)  method  has  two  main  advantages:  (i)  the  use 
of  simultaneous  matrix-block  multiplications  instead  of  individual  matrix-vector  multiplications  enables 
fast  memory  access  and  parallelizable  computation  on  modern  computer  architectures,  (ii)  when  comput¬ 
ing  a  sequence  of  closely  related  eigen-pairs,  stalling  from  a  block  matrix  facilitates  the  reuse  of  available 
eigenspace  information  in  the  form  of  warm-start  strategies.  Flowever,  a  severe  shortcoming  of  the  SSI 
method  is  that  its  convergence  speed  depends  greatly  on  eigenvalue  distributions  that  becomes  intolerably 
slow  with  unfavorable  eigenvalue  distributions.  This  shortcoming  basically  prevents  the  SSI  method  from 
being  used  as  a  general-purpose  solver.  A  few  SSI  acceleration  techniques  exist,  such  as  the  Chebyshev  ac¬ 
celeration  (e.g.,  [18,  25]),  but  their  effectiveness  and  applicability  remain  limited.  Presently,  general-purpose 
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solvers  for  large-scale  eigenvalue/singular  value  problems  are  largely  designed  for  sparse  or  otherwise  struc¬ 
tured  matrices. 

In  this  work,  we  propose  and  study  an  SSI-based  but  much  accelerated  algorithm  for  computing  dom¬ 
inant  SVDs  of  large  unstructured  matrices.  At  each  iteration,  the  algorithm  maximizes  the  Rayleigh-Ritz 
function  in  a  subspace  spanned  by  a  few  previous  iterate  blocks  with  an  adjustable  dimension.  In  doing  so,  it 
does  not  require  any  additional  matrix-block  multiplication.  The  algorithm  is  of  a  general-purpose  type  ca¬ 
pable  of  obtaining  highly  accurate  solutions,  as  opposed  to  only  rough  approximations.  Extensive  numerical 
experiments  have  been  conducted  to  verify  the  effectiveness  of  the  algorithm  with  different  singular  value 
distributions.  Computational  evidence  shows  that,  under  the  Matlab  environment,  a  simple  implementation 
of  the  proposed  algorithm,  called  LMSVD,  already  outperforms  some  state-of-the-art  SVD  solvers  on  a 
wide  range  of  test  matrices  within  the  target  class. 

The  idea  of  subspace  optimization  has  often  been  employed  in  optimization  such  as  in  nonlinear  pro¬ 
gramming  (see,  for  example,  [8,  4,  30]).  A  more  closely  related  work  is  “Locally  Optimal  Block  Pre¬ 
conditioned  Conjugate  Gradient  Method”  (LOBPCG)  proposed  by  Knyazev  [11]  for  solving  generalized 
eigenvalue  problems.  The  subspace  in  each  iteration  of  LOBPCG  consists  of  the  current  iterate  Xt:'\  the 
previous  iterate  X'l~' 1  and  a  residual  vectors  (possibly  preconditioned).  Our  choice  of  subspace  is  more 
general  and  its  dimension  is  adjustable.  Moreover,  the  subproblems  we  choose  to  solve  are  small  eigenvalue 
problems,  while  LOBPCG  solves  small  generalized  eigenvalue  subproblems.  Since  both  approaches  use 
subspace  optimization,  in  this  paper,  we  will  present  numerical  comparison  results  with  LOBPCG.  Another 
work  by  Yang,  Meza  and  Wang  [28]  applies  a  special-purpose,  three-block  subspace  optimization  technique 
to  a  nonlinear  eigenvalue  problems  in  electronic  structure  computation.  Computational  results  are  reported 
in  both  [11]  and  [28],  but  few  theoretical  results  are  given. 

This  paper  also  includes  a  convergence  analysis  for  our  proposed  acceleration  method  for  SSI.  Due  to 
the  introduction  of  intermediate  variables  and  additional  structural  changes,  the  convergence  analysis  for 
the  original  SSI  method  no  longer  applies  to  our  approach.  Using  a  different  analysis,  we  are  able  to  obtain 
global  convergence  results  under  reasonable  assumptions. 

1.2  Organization 

The  rest  of  this  paper  is  organized  as  follows.  We  describe  the  motivation  of  our  approach,  the  algorithmic 
framework  and  the  detailed  algorithm  in  section  2;  our  convergence  analysis  is  presented  in  section  3; 
and  several  implementation  issues  are  discussed  in  section  4.  Extensive  numerical  results  are  presented  in 
section  5  to  evaluate  the  robustness  and  efficiency  of  the  proposed  algorithm.  Linally,  we  conclude  the  paper 
in  section  6. 

1.3  Notations 

The  trace  of  a  matrix  M  <G  Mnxn  is  denoted  by  trace(M),  and  the  main  diagonal  of  M  is  denoted  by 
diag(M).  Given  a  symmetric  matrix  M  e  Mnxn,  A(M)  is  a  diagonal  matrix  whose  diagonal  elements 
are  the  eigenvalues  M  in  descending  order.  Superscripts  for  matrices  are  iteration  counters  if  they  are  in 
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parentheses;  otherwise  they  arc  exponents.  Further  notations  will  be  explained  as  they  appeal-  during  the 
progress  of  the  paper. 

2  Limited  Memory  Block  Krylov  Subspace  Optimization 

Let  us  reiterate  that  our  goal  is  to  compute  the  k-th  dominant  SVD  of  a  matrix  A  €  Mmxn  as  is  defined  in 
problem  (1),  and  our  method  is  based  on  accelerating  the  simple  subspace  (or  simultaneous)  iteration  (SSI) 
method  via  solving  problem  (2)  in  a  chosen  subspace  at  each  iteration. 

2.1  Motivation  and  Framework 

Starting  from  an  initial  point  X^  £  Mmxfc,  SSI  computes  the  next  iterate  A'?,+  l  j  by  the  formula 

X(i+V  e  orth  (AATX^  ,  (3) 

where  orth(M)  denotes  the  set  of  orthonormal  bases  for  the  range  space  of  M .  As  such,  the  iterates  of  SSI, 
with  a  possible  exception  for  the  initial  guess,  satisfy  the  orthogonality  condition  XTX  =  I.  When  k  =  1, 
the  SSI  method  reduces  to  the  well-known  power  method  for  computing  the  single  largest  eigenvalue  and 
its  eigenvector.  In  the  SSI  method,  the  orthonormalization  step  is  indispensable  (for  example,  see  [19]  for 
more  details). 

For  an  unstructured  matrix  A,  the  computational  costs  of  the  matrix-block  multiplication  (i.e.,  AATX) 
and  orthonormalization  in  (3)  are  0(nmk )  and  0(mk2),  respectively.  In  most  applications,  the  approxi¬ 
mating  rank  k  is  far  less  than  the  dimension  m.  Hence,  the  matrix-block  multiplications  of  the  type  AATX 
constitute  the  dominate  computational  cost  of  SSI.  Obviously,  an  acceleration  will  be  achieved  if  one  can 
reduce  the  number  of  iterations  without  having  to  incur  extra  matrix-block  multiplications  or  other  signif¬ 
icant  overhead.  To  achieve  the  goal  of  reducing  the  number  of  iterations,  we  propose  to  modify  the  basic 
SSI  framework  as  follows.  We  replace  the  last  iterate  jW  in  the  right-hand  side  of  (3)  by  an  “improved” 
intermediate  iterate  jW  so  that 

X^  £  orth  ,  (4) 

where,  for  a  chosen  subspace  with  a  block  Krylov  subspace  structure, 

X(i)  :=  argmax  ||ATA||^  s.t.  XTX  =  I,  X  €  S(i).  (5) 

xeRmxk 

Again,  X  £  Sil>  means  all  columns  of  X  are  from  the  subspace  S11'1 .  We  will  soon  specify  the  definition 
of  <SW  and  show  that  the  cost  of  solving  (5)  can  be  kept  relatively  low  when  the  dimension  of  the  subspace 
<S^  is  relatively  small. 

We  now  give  the  framework  of  our  proposed  algorithm  that  will  be  called  LMSVD-F  for  ease  of  refer¬ 
ence.  Further  details  about  this  framework  will  be  specified  in  the  next  subsection. 
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Algorithm  1:  Framework  LMSVD-F 

1  Input  A  and  k,  initialize  G  Mmxfc  and  set  i  =  0. 

2  while  “not  converged”  do 

3  Specify  the  block  Krylov  subspace  S^; 

/*  Block  Subspace  Optimization  */ 

4  Solve  X®  :=  argmax{||ATX||2  :  XTX  =  I,  X  €  S®}; 

/*  Simultaneous  Iteration  */ 

5  Compute  G  or  th  (AATX^); 

6  Increment  i  and  continue. 

2.2  Algorithm  Details 

We  now  describe  the  selection  of  the  subspace  <SW  which  is  constructed  from  a  limited  memory  of  the  last 
a  few  iterates.  Its  choice  is  of  course  not  unique.  We  first  consider  the  subspace  spanned  by  the  current  i-th 
iterate  and  the  previous  p  iterates;  i.e., 

S(i)  :=  span  ,  (6) 

where  the  memory  length  p  >  0  will  be  specified  in  Section  4.1. 

We  collect  the  current  and  the  other  p  saved  iterate  blocks  in  (6)  into  a  matrix 

X  =  xf)  :=  X(i),X(i"1),-,^(i'p)  erx?  (7) 

where  q  =  {p  +  l)k  is  the  total  number  of  columns  in  xj/c  For  notational  simplicity,  from  here  on  we  often 

(i) 

choose  to  drop  the  superscript  and  subscript  from  quantities  like  Xp  whenever  no  confusion  would  arise. 
Also  note  that  the  collection  matrix  X  is  boldfaced  to  make  it  distinct  from  its  blocks.  Similarly,  we  define 

Y  =  YW  :=  ATX«  :=  \atX^,AtX^~1\  ...,AtX^  1  G  Rmxq,  (8) 

which  is  also  saved  in  memory.  We  emphasize  that  SSI  already  computes  these  blocks  in  Y  and  we  only 
need  to  save  them  once  computed. 

It  is  clearly  that  X  G  <SW  if  and  only  if  X  =  XV  for  some  V  G  Wixk\  and  the  subspace  optimization 
problem  (5)  is  equivalent  to  a  generalized  eigenvalue  decomposition  problem: 

max  ||(ATX)l/||jk  s.t.  VT  (XTX)  V  =  I.  (9) 

ygjjqxfc 

However,  numerical  difficulty  may  arise  in  solving  (9)  as  the  matrix  XTX  can  become  numerically  rank 
deficient.  A  more  stable  approach,  which  we  will  implement,  is  to  find  an  orthonormal  basis  for  S^\  say, 

Q  =  QW  g  orth  (XW)  , 
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and  to  express  a  matrix  X  £  Sh:')  as  X  =  Q V  for  some  V  £  M9Xfc.  Here  we  assume  that  X  has  a  full 
rank  and  will  later  relax  this  assumption.  We  now  convert  the  generalized  eigenvalue  problem  (9)  into  an 
equivalent  eigenvalue  problem 

max  ||RV||i,  s.t.  VTV  =  /,  (10) 

VeR«xfe 

where 

R  =  R«  :=ATQW.  (11) 

Next  we  describe  how  to  calculate  the  matrix  product  R  in  (11)  from  historical  information  without  any 
additional  computation  involving  the  matrix  A.  Since  Q  £  ortli(X)  and  we  assume  that  X  has  a  full  rank, 
there  exists  a  nonsingular  matrix  C  £  Wixq  such  that  X  =  Q(7.  Hence,  Q  =  X(7_1,  and  R  in  (11)  can  be 
computed  as 

R  =  HtQ  =  =  YCT1,  (12) 

where  Y  =  A  1 X  is  accessible  from  our  limited  memory.  Once  R  is  available,  we  can  solve  (10)  by 
computing  the  k  leading  eigenvectors  of  the  q  x  q  matrix  RTR.  Let  a  solution  to  (10)  be  V.  The  matrix 
product  in  Line  5  of  the  framework  LMSVD-F  can  then  be  assembled  as 

AATX{i)  =  ARP  =  AYC~lV .  (13) 

The  remaining  issue  is  how  to  efficiently  and  stably  compute  Q  and  R  even  when  the  matrix  X  is  numeri¬ 
cally  rank  deficient.  We  use  the  following  procedure  in  our  implementation. 

Noting  that  each  block  in  X  is  individually  orthonormal,  we  choose  to  keep  the  latest  block  jW  intact, 
and  project  the  rest  of  the  blocks  onto  the  null  space  of  (2fW)T,  obtaining 

Px  =  pg  :=  •  •  •  X^  .  (14) 

Next  we  perform  an  orthonormalization  of  P  y  via  the  eigenvalue  decomposition  of  its  Gram  matrix 

P'vPv  =  UxAxUl,  (15) 

where  Ux  is  orthogonal  and  A  y  is  diagonal.  It  can  be  easily  verified  that  the  matrix 

Q  =  QW  :=  X«  PxUxA-x*  e  orth  (X?)  >  (16) 

provided  that  A  y  is  invertible.  The  above  procedure  can  be  stabilized  by  monitoring  the  numerical  rank  of 
Pjy ,  or  specifically  the  eigenvalues  on  the  diagonal  of  the  matrix  A  y  in  (15).  We  choose  to  implement  the 
following  two-step  stabilization  scheme : 

Step  1  Delete  the  columns  of  P  y  whose  Euclidean  norms  arc  below  a  threshold  e\  >  0. 

Step  2  Delete  the  eigenvalues  in  Ax,  and  corresponding  columns  in  Ux,  that  are  less  than  €2  >  0. 
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With  a  slight  abuse  of  notation,  we  will  continue  to  use  Pj,  Ux  and  Ax  to  denote  their  stabilized  versions, 
respectively,  after  possible  deletions.  Therefore,  a  stable  construction  of  Q  is  still  given  by  formula  (16). 
After  this  stable  orthonormalization,  the  corresponding  R  matrix  can  be  generated  as 

r  =  r«:=  y«  pYUxAx*  >  (17) 

where  before  the  stabilization  procedure  we  had 

Py  =  pW  :=  . . .  y(*-p)l  _  yW(x«)T  •  •  •  X^~p)]  ,  (18) 

but  some  of  the  columns  of  Py  may  have  been  deleted  coiTesponding  to  those  deleted  columns  of  P  \ 
during  the  stabilization  steps.  After  the  removal  of  numerical  rank  deficiency,  the  R  matrix  in  (17)  is  well 
defined  as  is  the  Q  matrix  in  (16). 

In  summary,  the  algorithm  performs  eigenvalue  decompositions  on  two  small  symmetric  positive  definite 
matrices:  PTXPX  in  (15)  and  RTR  in  (10).  The  sizes  of  the  two  matrices  are  pk  and  (p  +  1  )k,  respectively, 
and  frequently  smaller  due  to  deletions.  Our  computational  experience  indicates  that  in  general  p  should  be 
set  to  2  or  3,  or  at  most  4  but  not  greater.  Consequently,  when  k  is  sufficiently  smaller  than  m,  it  holds  that 
(p  +  1 ) /;:  <C  m  <  n.  In  such  a  case,  any  adequate  eigenvalue  solvers  for  dense  matrices  can  be  utilized 
to  solve  the  two  small  eigenvalue  problems  without  significantly  affecting  the  overall  performance  of  our 
algorithm. 

2.3  LMSVD  Algorithm  Statement 

Based  on  the  description  above,  we  state  our  full  Algorithm  with  the  exception  about  our  strategy  for  select¬ 
ing  the  memory  length  (or  block  size)  p  that  is  to  be  specified  later.  For  ease  of  reference,  the  algorithm  will 
be  referred  to  as  LMSVD. 

Algorithm  2:  Limited  Memory  Subspace  Optimization  for  SVD  (LMSVD) 

1  Input  A  and  k.  Initialize  X  =  X(°)  £  Wnxk,  Y  =  y(°)  =  ATX<'°\  andp  =  i  =  0; 

2  while  “not  converged”  do 

/*  Block  Subspace  Optimization  */ 

3  Compute  P  \  by  (14)  and  perform  stabilization  Step  1; 

4  Compute  Py  by  (18)  with  the  same  column  deletions  as  for  P x; 

5  Compute  the  eigenvalue  decomposition  of  P^P.v  in  (15); 

6  Perform  stabilization  Step  2  to  possibly  shrink  Y,x  and  Ux 

7  Compute  R  by  (17)  and  eigenvalue  decomposition  of  RTR; 

8  Let  V  solve  (10),  consisting  of  the  k  leading  eigenvectors  of  RTR; 

9  Compute  X W  =  QV  and  Y' W  =  Ry  (which  equals  ATXW); 

/*  Simultaneous  Iteration  */ 

10  Compute  X^  £  orth(AyW)  and  y(*+1)  =  ATX^+1^; 

11  Increment  i,  update  p,  X  and  Y,  and  continue. 
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In  Line  10,  the  orthonormalization  is  done  by  QR  factorization  in  our  implementation.  It  is  worth  noting 
that  the  solution  jW  to  problem  (10)  has  orthonormal  columns. 

It  is  also  reasonable  to  use  the  intermediate  iterates  X-3K  j  =  i  —  1,  •  •  ■  ,  i  —  p,  to  construct  a  limited 
memory  block  Krylov  subspace 

5(i)  :=  ,  (19) 

to  be  used  in  place  of  in  the  LMSVD  framework.  In  our  experiments,  this  alternative  subspace  con¬ 
struction  works  equally  well,  very  often  slightly  better.  Hence,  we  choose  to  use  the  above  subspace  as 
the  default  construction  in  our  LMSVD  implementation. 

2.4  Complexity  per  Iteration 

On  unstructured  matrices,  the  per-iteration  operation  count  of  the  classic  SSI  method  is  O(nrnk)  for  the 
matrix-block  multiplications,  and  0(mk 2)  for  the  orthonormalization  (say,  by  QR  factorization).  On  the 
other  hand,  the  additional  cost  per-iteration  of  LMSVD  is  chiefly  0(mk2(l+p)2)  for  finding  an  orthonormal 
basis  for  the  subspace  S^\  and  O ( 1C ( 1  +  p)3)  for  solving  the  two  small  eigenvalue  problems.  Comparing 
the  total  additional  cost  of  LMSVD  to  the  total  cost  of  SSL  i.e., 

O  ((m  +  k  +  kp)k2(l  +  p)2)  vs  0(mk(n  +  k)), 

we  can  see  that  when  p  <C  k  <C  m  <  n,  the  above  reduces  to  0{k(l  +  p)2)  vs  0(n  +  k ).  In  this  case, 
the  former  is  clearly  dominated  by  the  latter,  and  a  considerable  acceleration  to  the  SSI  method  becomes 
achievable  if  our  subspace  optimization  strategy  can  significantly  reduce  the  number  of  iterations  required. 

The  argument  above,  however,  hinges  on  the  fact  that  the  matrices  under  consideration  are  unstructured 
and  dense.  For  highly  structured  matrices  such  as  sparse  matrices,  the  cost  of  matrix-vector  multiplications 
involving  A  may  become  much  lower  than  0{nm).  In  this  case,  it  is  likely  that  the  overhead  introduced  by 
our  subspace  optimization  becomes  too  high  to  offer  any  benefits. 

3  Convergence  Analysis 

In  this  section,  we  study  convergence  properties  of  the  proposed  algorithm  LMSVD.  In  the  first  subsection, 
we  describe  the  first-order  necessary  optimality  conditions  of  (2)  and  restate  the  existing  convergence  results 
for  SSL  Then  we  give  a  convergence  result  for  a  class  of  SSI-based  algorithms,  including  both  SSI  and 
LMSVD.  The  convergence  is  generally  to  a  critical  point  of  (2);  but  under  mild  assumptions,  it  also  implies 
convergence  to  the  global  optimal  solution  of  (2). 

3.1  Preliminaries 

For  notational  simplicity,  let 

B  =  2L4t, 
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and  Ai  >  ...  >  Am  >  0  be  the  eigenvalues  of  B.  Since  the  matrix  XTX  is  symmetric,  the  Lagrangian 
multiplier  A  corresponding  to  XTX  =  I  is  a  symmetric  matrix  as  well.  The  Lagrangian  function  of  problem 
(2)  is 

£(X,  A)  =  r  (trac e(XT BX)  —  trace(A(XTX  —  /)))  . 


Due  the  orthogonality  constraint  XT X  =  I ,  the  linear  independence  constraint  qualification  always  holds 
true  for  (2).  Therefore,  the  first-order  necessary  conditions  for  optimality  of  (2)  are  that 


dx£(X,  A)  =  BX  -X A  =  0  and  XTX  =  I. 


(20) 


Multiplying  both  sides  of  the  first  equation  in  (20)  by  XT  and  using  X  X  =  I,  we  obtain  the  expression 

A  =  XTBX.  (21) 


Combining  (20)  and  (21),  plus  the  orthogonality  constraint,  we  can  write  the  first-order  necessary  optimality 
conditions  of  (2)  into 


(I  -  XXT)BX  =  0  and  XTX  =  I.  (22) 

It  is  clear  that  (i)  if  A'  is  a  critical  point,  so  is  XQ  for  any  orthogonal  matrix  Q;  and  (ii)  the  columns  of  any 
critical  point  X  span  a  nontrivial  invariant  subspace  of  B  (but  need  not  to  be  eigenvectors  of  B). 

To  facilitate  further  discussions,  we  denote  the  distance  between  a  matrix  X  £  Wmxk  and  a  set  £  C 
{X  \  X  £  Mmxfc}  as  dist(X,  £),  i.e., 

dist(X,  C)  :=  min  I IX  —  Y lip-  (23) 

YeC 

We  use  C*  and  IX  to  denote,  respectively,  the  sets  of  optimal  solutions  and  critical  points  of  (2);  that  is, 

C*  :=  {X  €  Mmxfc  |  X  is  an  optimal  solution  of  (2)},  (24) 

Cc  :=  {X  €  Mmxfc  |  X  satisfies  the  conditions  (22)}.  (25) 

3.2  Convergence  of  SSI 

Let  vi,v2,  :.,vm  be  the  unit  eigenvectors  of  B  associated  with  Ai  >  ...  >  \m  >  0,  respectively,  and 
in  particular  \%  :=  [u i ,  V2,  ■■■,  Vk\  £  Wnxk  consisting  of  the  k  leading  eigenvectors.  We  first  state  two 
standard  assumptions  for  SSL 

Assumption  3.1.  (a)  The  initial  iterate  X^0^  satisfies  the  condition  that  V^XX  is  nonsingular,  (b)  There 
is  a  gap  between  the  k-th  and  the  ( k  +  1  )-th  eigenvalues  of  B,  namely,  A&  >  Afc+i  >  0. 

SSI  has  the  following  well-known  convergence  property. 
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Proposition  3.1.  Let  {  Xll  't }  be  generated  by  SSI.  Under  Assumption  3.1,  the  distance  between  i-th  iterate 
and  the  optimal  solution  set  C*  of  (2)  converges  to  zero,  /.<?., 

lim  dist(X^,£*)  =  0, 

i— >-+oo 

Moreover,  an  upper  bound  for  the  asymptotic  convergence  rate  is  given  by 

r  dist(X(*+1),£*)  ^  Xk+i  ^ 

>+oo  dist(2C W,  C*)  A k 

We  should  point  out  that  convergence  properties  of  SSI  arc  normally  stated  in  terms  of  Ritz-value  and 
Ritz-vector  sequences  (see  [19]  and  [25],  for  example)  that  requires  further  calculations  in  addition  to  the 
main  SSI  operations  given  in  (4).  However,  the  statement  in  Proposition  3.1,  which  better  suits  our  purpose 
in  this  paper,  is  by  no  mean  weaker  than  other  more  “standard”  forms  of  convergence  results  for  SSI. 

3.3  Convergence  of  an  accelerated  class 

The  introduction  of  intermediate  iterates  jW  also  introduces  complications  in  the  analysis  of  the  acceler¬ 
ated  subspace  iterations.  Unable  to  directly  extend  the  convergence  proof  of  SSI  to  LMSVD,  we  take  a 
different  approach  and  establish  a  more  general  result  that  requires  no  assumption  whatsoever.  Naturally, 
the  conclusion  is  not  as  strong  as  the  existing  SSI  result  in  Proposition  3.1.  The  result  is  Theorem  3.1  and 
is  applicable  to  a  whole  class  of  SSI-based  algorithms  that  we  call  the  SSI+  class  defined  below,  which 
includes  both  SSI  and  LMSVD  as  special  cases.  Recall  that  <J>(X)  =  ||AT9r  ||p. 


Algorithm  3:  SSI+  Class 

t  Input  matrix  A  e  Kmxn  (m  <  n)  and  positive  integer  k  <  m. 

2  Initialize  G  Rmxk  so  that  (X^)TX^  =  I  and  ATX^  f  0. 

3  for  i  =  0, 1,  2,  •  •  •  do 

4  Find  <5  Rmxk  such  that  (xW)TlW  =  I  and  $(1®)  >  $(jW). 

5  Let  2sA*+1)  £  orth(2L4TXW),  and  update  k  =  rank(2f^i+1^)  if  necessary. 


Clearly,  if  we  set  jW  =  jW  at  every  iteration,  we  recover  SSI. 

Theorem  3.1.  Let  the  sequence  {X1,1'1 }  be  generated  by  a  member  of  the  SSI+  class.  Then  any  cluster  point 
of  the  sequence  is  a  critical  point  of  (2)  that  spans  a  non-trivial  invariant  subspace  of  AAT;  i.e., 

lim  Cc)  =  0, 

i— >-+oo 

We  note  that  the  sequence  is  bounded  by  construction,  hence  a  cluster  point  exists.  In  the  rest  of  this  sub¬ 
section,  we  will  develop  a  proof  for  Theorem  3.1  based  on  two  technical  lemmas,  which  show  monotonicity 
of  $(X)  during  iterations  and  establish  a  connection  to  the  critical  point  condition  (22). 
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Lemma  3.1.  Let  B  <G  Wn  x  rn  be  symmetric  positive  semidefinite  with  rank  r  >  0.  For  any  nonzero  vector 
x  €  R”\  there  holds 

(. xtB3x)(xtx )  —  ( xtB2x)(xtBx )  >  A r(B)  (( xtB2x)(xtx )  —  (xtBx)2)  ,  (26) 

where  Xr(B)  >  0  is  the  smallest  positive  eigenvalue  of  B.  Moreover,  whenever  xTx  =  1  and  Bx  f  0, 
inequality  (26)  implies 

-  %JBx  +  irWTT  ( xtB2x  -  ( xtBx )2)  .  (27) 

x1  Bzx  \\B\\2 

Proof.  Without  loss  of  generality,  we  assume  that  B  is  diagonal;  otherwise,  it  suffices  to  replace  x  by  Qt  x 
and  Bl  by  QT B(Q  for  I  =  1,2, 3,  where  the  columns  of  Q  consist  of  the  unit  eigenvectors  of  B.  We  now 
prove  (26)  for  a  diagonal  matrix  B.  By  expanding  the  terms  in  the  left-hand  side  of  (26),  we  obtain 

(xT  B3x)(xtx)  —  ( xTB2x)(xTBx ) 

m  m  m  m 

=  Y,  B?ix2i  YJ'r  Y  BiiXi  Y  BBXJ 

i=  1  j= 1  7=1  j= 1 

=  ^  af.rji/4  B2!!,,) 

l<7j<m 

=  Y  xix2j(Bii  ~  BiiBn)  +  Y  x2ix%Bii  ~  BiiBii) 

Y  x\x)(Bl  +  B:X  -  BlBjj  -  BuB2n) 

=  y  xix2j(Bu  +  Bjj)(Ba  -  Bn )2- 

In  the  above  derivation,  we  should  note  the  changes  in  the  range  of  summation. 

Similarly,  the  second  term  in  the  right-hand  side  of  (26)  can  be  expanded  into 

(xT  B2x)(xTx)  —  (x1  Bx)2  =  Y^  x2x2(Bu  —  Bjj)2. 

If  Brt  +  Bjj  =  0,  then  Bn  —  Bjj  =  0  due  to  the  nonnegativity  of  diag (B):  Otherwise, 

1 1 1 1  ~\~  Bjj  ■  '  max(Sjj,  Bjj)  f  X r(B). 

Combining  all  above  together,  we  deduce  that 

(xTB3x)(xTx)  —  (xTB2x)(xTBx)  >  X r{B)  Y  XiX](B™  ~  Bjj)2 

l<i<j<m 

which  proves  (26).  Finally,  when  xTx  =  1  and  Bx  f  0,  we  divide  the  left-hand  side  of  (26)  by  x  tB2x  and 
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the  right-hand  side  by  ||_B|||  >  xJB2x,  leading  to  (27).  □ 

Lemma  3.2.  Let  B  E  W"  x  rn  be  symmetric  positive  semidefinite  with  rank  r  >  0,  and  let  X  <G  Mmxfc  satisfy 
X T X  =  I  and  0  <  i  =  rank ( B X )  <  min(r,  Then,  for  any  Y  E  orthf  BX )  there  holds  that 

trac e(YTBY)  -  trac e(XTBX)  >  ||  (I  -  XXT)BX\\ 2 .  (28) 

PII2 

Proof  We  first  note  that  for  any  given  V) ,  Y2  G  orth(£?2f)  there  exists  an  orthogonal  matrix  U  E  irx^ 
so  that  Y\  =  Y2[7,  implying  trace) V)1/!!) )  =  trace (Y2T BYf).  Hence,  it  suffices  to  prove  (28)  for  a 
particular  Y  E  orth(HX). 

Let  BX  =  UDVt  be  the  economy-form  singular  value  decomposition  of  BX\  namely,  U  E  Mmx^ 
and  V  E  Mfcx/  have  orthonormal  columns  and  D  E  Vf  yJ:  is  diagonal  and  positive  definite.  We  select 
Y  =  U  E  orth(liX),  or  equivalently. 


Y  =  BXVD~l  =  BZD -1 

where  Z  =  XV  satisfies  ZT Z  =  I.  It  is  easy  to  see  that 

trace(YTHY)  =  trace((HZ.D”1)TH(JBZZr1))  =  trace(D~1ZTH3ZL»-1). 


Hence,  for  i  =  1,2, 


(YTBY)U  =  zjB3Zi/Dl, 
1  =  (YTY)U  =  zjB2zt/Dl 


where  Z{  is  the  7-th  column  of  Z.  It  follows  from  (30)  that 

l 

trace(YTHY)  = 

i= 1 


zjB3Zi 
zjB2Zi ' 


In  view  of  (31)  and  (27),  we  deduce 
trace(YT  BY)  > 

> 


X]  (  z^Bzi  +  T frn  ( zjB2*  -  ( zlBzi )2) 


i=  1 


trace(ZTHZ)  +  (P^lli  -  (^-S**)2) 


trace(ZTHZ)  + 


P... 

Ar(5) 

Pill 


|HZ||i-||ZTHZ||i) 


(29) 


(30) 


(31) 


Recall  that  Z  =  XV,  BX  =  UDVJ  and  VTV  =  I.  We  derive  that 

trac  e(ZJBZ)  =  trac  e(VT  (VDUT)XV)  =  trac  e{VDUTX)  =  trac  e(XTBX). 
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Similarly,  we  can  prove  that  ||.BZ||p  =  ||SX||p  and  \\Z[  BZ\\y,  =  \\X'1  BX\\y..  Therefore, 

trace(yTSy)  >  trace(ZTHZ)  +  {\\BZ\\l  -  \\ZTBZ\\%) 

=  trac e(XTBX)  +  {\\BX\\%  -  \\XTBX\\^) 

=  trace(XTBX)  +  ^I^-\\BX-X{XJBX)\\l 
\\BW2 

=  trace(XTBX)  +  ^^-\\{I-XX1')BX\\l, 

\\B\\2 

which  completes  the  proof.  □ 

Now  we  are  ready  to  present  a  proof  for  Theorem  3.1. 

Proof.  By  initialization,  =  ||ATAA°)||p  >  0.  Now  we  show  that  {<3>(ArW)}  js  nondecreasing, 

implying  that  the  sequence  is  bounded  away  from  the  zero  matrix.  After  each  iteration,  the  change  in 
4>(2T)  =  trace(XTzL4TAT)  is 

$(X(i+1))  -  4>(XW)  >  $(X(i+1))  -  (32) 

since  by  construction.  By  Lemma  3.2,  for  B  =  2L4T, 

$(AT(i+1))  -  4>(AW)  >  (/  -  XW(JW)1')  BX^  2  ,  (33) 

\\Bh  V  /  f 

where  Ar(/i)  >  0  is  the  smallest  positive  eigenvalue  of  B.  It  follows  from  (32)  and  (33)  that  the  sequence 

{4>(ArW)}  is  monotonically  non-decreasing.  It  is  clearly  positive  bounded  above  and  convergent,  hence  by 

virtue  of  (33)  forcing 

lirn  (i  -  iW(lW)T)  BX®  2  =  0.  (34) 

i-r+oo  V  /  F 

Consequently,  every  limit  point  is  a  critical  point  satisfying  (22).  The  conclusion  of  the  theorem  follows 
readily  by  standard  arguments.  □ 

3.4  Remarks  on  the  convergence  result 

Theorem  3. 1  is  applicable  to  a  broad  class  of  SSI  acceleration  schemes.  For  example,  any  method  producing 
an  approximate,  feasible  solution  to  the  subspace  optimization  problem  (5)  is  admissible  as  long  as  the 
function  4>(2T)  value  is  not  decrease.  The  theorem  is  established  under  the  absolutely  minimal  assumption; 
i.e.,  the  iterations  have  a  nontrivial  staid.  Even  in  the  case  of  SSI,  the  result  is  not  implied  by  existing  results 
since  a  gap  in  singular  values  is  not  necessary. 

As  perhaps  is  expected,  the  conclusion  of  Theorem  3.1  is  also  weak  in  the  sense  of  only  guaranteeing 
convergence  to  a  critical  point,  but  not  to  a  global  maximizer.  However,  in  the  context  of  solving  (5), 
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guaranteed  convergence  to  a  critical  point  is  not  as  weak  as  it  could  be  for  a  more  general  non-concave 
maximization  problem  since  (5)  does  not  have  any  local  non-global  maximizer,  as  has  been  long  known 
(and  can  be  shown  easily  from  second-order  optimality  conditions).  As  such,  it  seems  highly  unlikely 
for  a  subspace  maximization  scheme  like  LMSVD  to  be  attracted  to  any  saddle  point.  Indeed,  we  will  later 
present  numerical  evidence  to  demonstrate  that  LMSVD  almost  always  converges  to  optima  even  it  is  started 
extremely  close  to  a  non-optimal  saddle  point.  In  our  extensive  numerical  experiments,  LMSVD  behaves  as 
robustly  as  SSI,  if  not  more  so,  in  terms  of  converging  to  a  global  maximizer. 

Finally,  we  mention  that  if  we  add  the  assumption  that  the  sequence  i.e.,  coefficient  matrices 

associated  with  the  dominant  eigenvector  of  AAT,  remains  uniformly  nonsingular,  then  under  Assump¬ 
tion  3.1  we  could  derive  convergence  results  for  LMSVD  similar  to  those  in  Proposition  3.1  for  SSI. 


4  Practical  Issues 

Recall  that  the  subspace  <S>6)  is  constructed  from  the  current  iterate  and  p  previous  ones.  In  this  section,  we 
describe  a  strategy  for  selecting  the  memory  length  p  from  iteration  to  iteration.  We  also  discuss  the  choice 
for  the  number  of  guard  vectors  that  are  commonly  used  in  eigenvalue  solvers,  and  specify  the  stopping  rule 
used  in  LMSVD. 


4.1  Memory  Length 

The  memory  length  p,  used  for  constructing  the  subspace  <SW  in  (6),  is  a  crucial  parameter  to  the  perfor¬ 
mance  of  our  algorithm.  The  simplest  way  is  to  assign  a  constant  integer  value  pmax  to  p  at  every  iteration 
once  the  iteration  counter  i  reaches  pmax;  that  is,  at  iteration  i, 

p  =  min(i,pmax).  (35) 


In  general  a  larger  pmax  leads  to  a  smaller  number  of  iterations,  but  increasing  pmax  also  increases  the 
computational  costs  per  iteration.  Our  computational  experiments  indicate  that  usually  a  good  balance  is 
attained  for  pmax  €  {2,3,4}. 

We  have  also  found  that  an  adaptive  strategy  on  selecting  p  is  useful  to  improving  the  performance  of 
LMSVD.  As  the  iterate  sequence  converges,  the  neighboring  iterates  tend  to  become  more  and  more  linearly 
dependent.  Therefore,  once  judged  appropriate  it  is  beneficial  to  shrink  the  memory  by  deleting  a  block  from 
the  memory,  reducing  the  size  of  later  subspace  optimization  problems.  Specifically,  after  pmax  iterations, 
we  activate  the  following  adaptive  memory  size  strategy: 


p  = 


Nc(  R) 
k 


-1, 


(36) 


where  |~i~|  is  the  smallest  integer  greater  than  or  equal  to  t,  and  iVc(R)  is  the  number  of  columns  in  R  (see 
Line  7  of  Algorithm  LMSVD)  which  can  be  smaller  than  (p  +  1  )k  due  to  possible  deletions  done  in  the  two 
stabilization  steps.  Combining  (35)  and  (36),  we  reach  our  formula  for  selecting  the  memory  length  p  at  the 
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i-th  iteration: 


p  =  min 


Nc(  R) 

k 


I;  Pmax 


(37) 


which  is  nonnegative.  Generally,  p  initially  increases  to  reach  pmax,  then  becomes  non-increasing  with  a 
probability  to  decrease  to  a  smaller  value,  even  possibly  to  zero.  Of  course,  when  the  memory  length  p 
becomes  zero,  our  method  reduces  to  the  classic  SSI. 


4.2  Guard  Vectors 

It  is  well-known  that  when  the  SSI  method  is  applied  to  a  symmetric  positive  semidefinite  matrix  B  €  Mmxm 
with  to  x  k  block  iterates,  the  convergence  to  the  eigenspace  of  the  first  r  <  k  leading  eigenvectors  has  an 
asymptotic  rate 

■^fc+l/'V  —  ^k+l/^k 

where  A j,  j  =  1,  2,  ■  ■  ■  ,  to,  arc  the  eigenvalues  of  B  in  a  descending  order.  If  we  really  only  need  to 
compute  the  first  r  <  k  leading  eigen-pairs,  the  additional  k  —  r  vectors  arc  called  "guard  vectors”  and 
play  the  role  of  accelerating  convergence.  In  general,  the  more  guard  vectors  arc  used,  the  less  iterations  arc 
needed  for  convergence,  but  at  a  higher  cost  per  iteration  on  memory  and  computing  time.  Our  algorithm, 
built  on  the  basis  of  the  SSI  framework,  also  benefits  from  the  use  of  guard  vectors. 

Now  assume  that  the  r-th  dominant  SVD  of  ,4  G  Rmxn  (to  <  n)  is  to  be  computed.  We  implement  the 
following  standard  choice  for  k  >  r: 


k  =  min{2r,  r  +  10,  to}. 


(38) 


4.3  Stopping  Rule 

It  is  most  natural  to  monitor  convergence  by  observing  the  magnitude  of  the  residual  matrix 

Res  :=  (/  -  XXt)AAtX  =  AY  -  XYTY  (39) 


where  Y  =  ATX  which  is  computed  at  each  iteration,  preferably  in  a  relative  sense;  but  it  requires  some 
additional  matrix  multiplications.  To  reduce  unnecessary  overheads,  in  LMSVD  we  use  a  two-level  strategy 
widi  a  pair  of  stopping  criteria:  a  “low-cost”  one  and  a  “high-cost”  one.  We  examine  the  low-cost  criterion 
at  each  iteration,  and  activate  the  high-cost  one  only  when  the  low-cost  one  is  satisfied. 

The  low-cost  criterion  is  based  on  examining  the  r  leading  Ritz  values  of  AAT  in  the  subspace  Sl,'> 
that  are  available  after  the  subspace  optimization  problem  (10)  is  solved  at  each  iteration.  Specifically,  the 
criterion  is,  for  some  tolerance  et  >  0, 


A«  -  a,!*-1) 


<  Q 


AW 


(40) 


(i)  /  (i)\T  (i) 

where  A)  consists  of  the  r  leading  eigenvalues  of  (R}  )  Rp  with  R7,  given  by  (17).  This  criterion  not 
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only  can  be  observed  at  almost  no  additional  cost,  but  also  is  a  reasonably  accurate  indicator  of  convergence. 

Once  the  current  iteration  satisfies  the  low-cost  criterion  (40),  we  start  to  check  the  residual  in  (39)  in 
a  relative  sense.  To  avoid  extra  computation,  we  evaluate  the  residual  at  the  intermediate  variable 
Specifically,  the  high-cost  criterion  is,  for  some  tolerance  eh  >  0, 


Res(i  <  eft  j  =  l,...,r, 


(41) 


where  A  max  is  the  largest  eigenvalue  of  (Rp  j)TRp';,  e:l  is  the  j- th  column  of  the  identity  matrix,  and 


.(OYTtjW 


Res(i-1)  =  ay^-v  - 


(42) 


We  note  that  only  the  first  r  columns  of  Re.s('~ 1  )  need  to  be  evaluated,  and  the  SSI  method  already  does  the 
calculation  of  AY  =  AATX. 

To  relate  the  tolerance  ey  for  the  low-cost  criterion  to  ry,,  we  choose  the  formula  ep  =  ■Je.hZ  after 
some  numerical  experiments,  where  e  is  the  machine  epsilon  (in  double  precision,  e  =  2.2204  x  10-16). 
Empirically,  this  proves  to  be  a  well  balanced  choice  for  efficiency  and  reliability  in  our  two-level  strategy. 
The  number  of  iterations  in  which  the  high-cost  criterion  is  checked,  on  the  average,  is  about  10%  of  the 
total  number  of  iterations.  We  note  that  the  tolerance  for  the  high-cost  criterion,  6/,,  becomes  the  only  free 
parameter  in  our  stopping  rule.  For  simplicity,  in  the  sequel  we  will  just  refer  to  it  as  e,  which  will  be  varied 
in  some  numerical  experiments. 


5  Numerical  Experiments 

In  this  section,  we  demonstrate  the  effectiveness  of  the  LMSVD,  as  a  general  solver  for  computing  dominant 
SVDs  of  unstructured  matrices,  by  numerical  experiments  on  a  wide  variety  of  randomly  generated  matrices 
as  well  as  a  few  examples  from  applications.  Our  code  is  implemented  in  MATLAB.  All  the  experiments 
were  preformed  on  a  HP  laptop  with  Intel®  dual  core  i5-460M  CPU  at  2.53GHz  (x2)  and  8GB  of  memory 
running  Ubuntu  10.10  and  MATLAB  2010b. 

We  uses  r  to  denote  the  targeted  number  of  dominant  singular  values  and  vectors.  Considering  the  guard 
vector  discussed  in  section  4.2,  a  /c-dominant  SVD  problem  is  actually  solved,  in  which  k  is  determined  by 
(38)  where  the  default  value  of  the  parameter  £  =  10.  The  default  tolerance  value  is  e  =  10-8  in  the 
stopping  criterion  (41).  The  limited  memory  subspace  defined  in  (19)  is  always  used,  unless  specified 
otherwise.  The  maximal  blocks  of  previous  iterates  in  limited  memory  subspaces  is  set  to  pmax  =  3  in  (37). 
In  addition,  based  on  some  empirical  experiments,  we  set  the  tolerance  values,  e\  and  e>,  used  in  our  two- 
step  stabilization  scheme  (see  the  discussion  after  (15))  as  follows:  e\  =  5  x  10-8  and  e 2  =  min(e/j,  \fe), 
where  eh  is  the  tolerance  in  (41)  and  e  is  the  machine  precision  (in  double  precision,  e  =  2.2204  x  10~16). 
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5.1  Random  problem  generation 

In  this  section,  unless  otherwise  specified,  our  random  matrix  A  G  Wnxn,  assuming  rn  <  n  without  loss  of 
generality,  is  constructed  by  one  of  the  two  models  below  that  we  call  Model  1  (left)  and  Model  2  (right): 

A  :=  UDVt  or  A  :=  DR.  (43) 

In  Model  1,  the  matrices  U  G  Rmxm  and  V  G  Rnxm  are  first  generated  with  i.i.d.  standard  Gaussian 
entries  and  then  orthogonalized  using  (economy-size)  QR  decompositions,  while  D  G  R'mxm  is  a  nonneg¬ 
ative  diagonal  matrix.  This  construction  builds  the  singular  value  decomposition  of  A,  but  the  involved 
orthogonalization  can  become  costly  as  matrix  sizes  increase.  In  Model  2,  I)  G  Rmxm  is  still  diagonal,  and 
R  G  Rnxm  is  a  random  matrix  whose  entries  are  i.i.d.  standard  Gaussian.  Clearly,  the  construction  cost  of 
Model  2  is  much  lower. 

In  both  models,  the  diagonal  entries  of  D  are 

Du  :=  maxj/31”*,  e2},  Vi  =  1,  2,  (44) 

where  e  is  the  tolerance  used  in  the  stopping  criterion  (41).  The  parameter  ft  >  1  determines  the  decay  rate 
of  singular  values,  precisely  in  Model  1  and  approximately  in  Model  2.  Generally  speaking,  the  closer  ft 
is  to  1,  the  smaller  decay  there  is  in  singular  values  of  a  generated  matrix,  and  the  more  difficult  the  test 
instance  becomes. 

5.2  Sensitivity  with  respect  to  different  parameters 

In  this  subsection,  we  justify  via  numerical  evidence  the  default  choices  we  made  in  the  LMSVD  implemen¬ 
tation;  i.e.,  the  choice  of  a  subspace  construction  between  (6)  and  (19),  the  choice  of  the  number  of  guard 
vectors,  and  the  choice  of  pmax  -  the  maximum  memory  length.  For  this  purpose,  we  perform  three  sets 
of  tests  using  six  randomly  generated  matrices,  all  from  Model  1  where  the  decay  parameter  for  singular 
values  was  set  to  ft  =  1.01.  The  sizes  of  the  test  matrices,  (m,  n ),  and  the  number  of  computed  dominant 
singular  values,  r,  are  set  to  be  (rn.  n )  =  (2000, 4000)  with  r  =  40, 80, 120,  and  (m,  n)  =  (4000,  4000) 
with  r  =  80, 120, 160,  totaling  6  different  test  scenarios. 

We  first  compare  the  performance  of  LMSVD  using  subspaces  in  (6)  and  (19)  that  we  call  subspaces  1 
and  2,  respectively,  for  convenience.  The  comparison  is  based  on  three  quantities:  the  CPU  time  in  seconds, 
the  relative  error  between  the  computed  and  the  exact  r  dominant  singular  values,  say  E  and  E*  respectively, 
||E*  —  E|  |/|  | E* |  j,  and  the  total  number  of  matrix- vector  multiplications.  The  results  in  terms  of  the  above 
three  metrics  are  depicted  in  plots  (a)-(c)  of  Figure  1,  respectively.  From  these  plots,  we  can  see  that 
LMSVD  using  subspace  1  defined  in  (19)  required  fewer  number  of  matrix- vector  multiplications  than  that 
using  subspace  2  defined  in(6),  and  achieved  a  similar  solution  quality  within  a  similar  amount  of  CPU  time. 
Hence,  we  choose  to  use  (19)  in  all  subsequent  experiments. 

Our  next  test  is  to  study  the  behavior  of  LMSVD  with  respect  to  the  number  of  guard  vectors.  Specifi¬ 
cally,  we  run  LMSVD  with  £  in  (38)  going  over  the  set  {0,  5, 10,  20, 40,  r}.  The  corresponding  performance 
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Figure  1 :  Performance  of  LMSVD  using  two  different  subspaces 


(a)  CPU  time  (b)  Relative  error  (c)  Matrix-vector  multiplications 


of  LMSVD  is  illustrated  in  the  plots  (a)-(c)  of  Figure  2.  From  these  results,  it  appears  that  £  =  10  is  a  well 
balanced  choice,  and  is  set  as  subsequently  the  default  value. 

Figure  2:  Performance  of  LMSVD  with  respect  to  the  number  of  guard  vectors. 


(a)  CPU  time  (b)  Relative  error  (c)  Matrix-vector  multiplications 

We  now  consider  the  choice  of  pmax,  which  determines  the  longest  memory  (i.e.,  number  of  previous 
iterates)  used  in  LMSVD.  We  test  6  different  values:  pmax  =  0  to  5.  Note  that  pmax  =  0  gives  the  classic 
SSI  method.  The  performance  of  LMSVD  on  this  test  is  presented  in  plots  (a)-(c)  of  Figure  3,  from  which 
we  can  see  a  significant  performance  gap  between  pmax  =  0  and  pm ax  >  0.  A  slight  overall  improvement  is 
still  detectable  as  pmax  increases  from  1  to  3.  Afterward,  the  benefit  of  increasing  pmax  seems  to  diminish 
as  the  cost  of  subproblem  solving  outpaces  the  saving  from  a  reduced  iteration  number.  Consequently,  we 
choose  to  use  pmax  =  3  as  the  default  value  in  our  tests  (although  we  notice  that  when  the  ratio  k/r  is  close 
to  one,  a  slightly  bigger  pmax  may  yield  better  performance). 

It  is  well  known  that  besides  the  global  maximum  and  minimum,  all  other  critical  points  of  (2)  are  saddle 
points.  We  present  numerical  evidence  to  show  that  it  is  highly  unlikely  for  the  algorithm  to  be  attracted 
to  a  saddle  point  even  under  extreme  conditions.  In  this  set  of  experiments,  the  sizes  of  the  test  matrices 
are  (rn,  n,  r )  =  (2000, 4000, 40).  The  initial  guess  is  generated  near  a  saddle  point  Xs  of  (2)  whose 
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Figure  3:  Performance  of  LMSVD  with  respect  to  pmax. 


(a)  CPU  time 


(b)  Relative  error  (c)  Matrix-vector  multiplications 


columns  are  singular  vectors  randomly  selected  from  the  matrix  U  in  (43).  Specifically, 

X (°)  =  Xs  +  9  randn(m,  k) 

where  6  >  0  controls  the  distance  between  X and  Xs,  and  randn(m,  k )  is  the  Maltab  command  for 
generating  random  Gaussian  matrices  of  size  m  by  k.  Under  the  default  setting,  we  compare  the  behavior 
of  LMSVD  with  that  of  the  SSI  method  (i.e.,  LMSVD  with  pmax  =  0).  A  summary  of  numerical  results  is 
given  in  Table  5.2. 

We  observe  that  when  there  is  a  notable  decay  in  singular  values,  both  algorithms  could  move  away  from 
a  saddle  point  and  eventually  converge,  even  in  the  extreme  case  of  X1'11'1  =  Xs  (presumably  due  to  rounding 
eiTors).  It  becomes  more  difficult  to  escape  from  a  saddle  point  when  there  is  little  decay  in  singular  values. 
This  should  be  expected  since  in  this  case  the  objective  value  at  a  saddle  point  can  be  extremely  close  to  that 
at  the  maximum.  It  is  interesting  to  note  that  whenever  LMSVD  failed  to  escape,  so  did  the  SSI  method. 
Empirically,  LMSVD  appeal's  to  have  a  similar  convergence  behavior  as  SSL  if  not  a  better  one.  We  note 
that  SSI  has  been  proven  in  theory  to  possess  a  guaranteed  convergence  to  the  global  maximum. 

5.3  Performance  comparison  on  random  matrices 

We  now  compare  the  performance  of  LMSVD  with  several  state-of-the-art  SVD  solvers  including  the 
Matlab  built-in  function  EIGS  which  interfaces  with  the  Fortran  package  ARPACK  [14],  a  Matlab  ver¬ 
sion  of  LOBPCG  [1 1]  1  and  a  Matlab  version  of  the  solver  LANSVD  in  the  package  PROPACK  [12]  2.  In 
addition,  we  also  compare  with  SSI  without  acceleration  (i.e.,  pmax  =  0  in  LMSVD)  and  with  Chebyshev 
acceleration,  called  SSI+C,  which  is  SSI  augmented  by  extra  steps  introduced  in  section  1  of  chapter  6  in 
[25].  Instead  of  the  dedicated  Matlab  built-in  interface  SVDS  for  singular  value  decomposition,  we  choose 
to  directly  use  the  function  EIGS  since  our  numerical  experiments  indicate  that  the  performance  of  the  for¬ 
mer  is  often  significantly  inferior  to  that  of  the  latter  in  many  instances.  We  also  mention  that  the  Matlab 
version  of  LANSVD  performs  re-orthogonalization  calculations  by  calling  a  Fortran  subroutine  via  Matlab’s 

'Downloadable  from  http :  //www  .mathworks  .  com/mat  labcentral/f  ileexchange/ 4  8. 

2Downloadable  from  http  :  //soi  .  Stanford .  edu/  ~  rmunk/ PROPACK. 
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Table  1 :  An  illustration  of  global  convergence  to  the  solution  set 


Test  Settings  j 

j  LMSVD  | 

SSI 

P 

e 

#iter. 

error 

#iter. 

error 

1.1 

l.e-08 

3 

5.3548e-15 

6 

8.8056e-15 

1.1 

l.e-09 

3 

2.8147e-15 

6 

1.4222e-14 

1.1 

l.e-10 

3 

5.1257e-15 

7 

4.4492e-15 

1.1 

l.e-11 

3 

2.6859e-15 

6 

6.2115e-15 

1.1 

0 

3 

2.5680e-15 

7 

3.5907e-14 

1.01 

l.e-08 

8 

4.9308e-14 

56 

1.0840e-12 

1.01 

l.e-09 

7 

8.1987e-14 

52 

9.4457e-13 

1.01 

l.e-10 

7 

7.1 1 16e-14 

53 

9.3774e-13 

1.01 

l.e-11 

8 

2.4877e-14 

54 

1.0526e-12 

1.01 

0 

9 

5.5088e-14 

62 

9.8569e-13 

1.001 

l.e-08 

42 

2.1386e-12 

150 

1.0049e-05 

1.001 

l.e-09 

44 

1.4742e-12 

150 

1.7350e-05 

1.001 

l.e-10 

2 

3.8007e+00 

2 

3.8007e+00 

1.001 

0 

2 

3.8007e+00 

2 

3.8007e+00 

MEX  external  interface. 

Our  experiments  are  performed  on  a  large  set  of  random  matrices  generated  by  the  two  models  described 
in  Section  5.1.  The  compared  quantities  arc  CPU  time,  the  number  of  matrix- vector  multiplications  and 
the  solution  quality.  For  LMSVD,  each  evaluation  of  the  product  AATX  is  counted  as  2k  matrix-vector 
multiplications  where  k  is  the  number  of  columns  in  X  defined  by  (38).  Solution  quality  is  measured  by  the 
relative  error  (in  2-norm)  of  the  r  computed  dominant  singular  values  against  their  most  accurate  available 
values.  For  matrices  generated  by  Model  1,  the  exact  singular  values  are  known  to  begin  with.  For  matrices 
generated  by  Model  2  where  exact  solutions  arc  not  known,  relative  errors  arc  calculated  with  respect  to  the 
solutions  of  EIGS  for  all  algorithms  other  than  EIGS. 

In  all  the  experiments  throughout  this  section,  EIGS  and  LOBPCG  are  invoked  with  their  default  param¬ 
eters.  For  LANSVD,  however,  we  have  observed  that  its  performance  in  large-decay  cases  can  be  improved 
if  its  parameter  “lanmax”  is  set  to  2 r,  while  the  default  value  is  more  appropriate  in  small-decay  cases. 
Therefore,  in  order  to  obtain  better  results  from  LANSVD,  we  use  the  default  value  for  the  parameter  “lan¬ 
max”  if  /3  <  1.1  and  the  value  2 r  otherwise. 

The  first  experiment  is  to  evaluate  the  algorithms  with  different  stopping  tolerance  values,  with  the  goal 
of  finding  a  proper  tolerance  value  to  be  used  in  later  experiments.  Generally  speaking,  when  considering  a 
group  of  algorithms,  the  relationship  between  tolerance  and  solution  quality  can  be  complicated  due  to  the 
existence  of  multiple  factors  affecting  individual  algorithms.  For  this  test,  we  construct  four  examples,  using 
Model  1  so  that  the  exact  singular  values  are  known,  with  size  parameters  (m,n,r,/3)  set  to,  respectively, 
(2000,4000,40,1.01),  (2000,4000,80,1.01),  (2000,4000,40,1.1)  and  (4000,4000,40,1.01).  We  will 
call  them  test  cases  1  to  4.  The  results  on  relative  error  are  presented  in  Figure  4,  where  in  each  plot  the 
horizontal  axis  represents  tolerance  value  varying  from  10~4  to  10~12. 

We  observe  from  Figure  4  that  the  accuracy  of  EIGS  is  insensitive  to  tolerance  value  in  the  testes  cases, 
always  achieving  the  machine  precision  even  with  the  largest  tolerance  value  10“  4.  Although  other  three 
algorithms  demonstrate  differently  sensitivity,  they  all  achieve  a  precision  of  order  10-15  with  tolerance 
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Figure  4:  Relationship  between  tolerance  and  relative  error. 


(a)  test  case  1  (small  decay)  (b)  test  case  2  (small  decay) 


(c)  test  case  3  (large  decay)  (d)  test  case  4  (small  decay) 


value  1CT10.  Therefore,  we  will  use  10-10  as  the  default  tolerance  value  for  LMSVD,  SSI+C,  LANSVD 
and  LOBPCG  in  the  rest  of  the  paper. 

In  addition,  since  SSI+C  clearly  outperforms  SSI  in  all  tested  instances,  not  only  in  terms  of  accuracy  as 
shown  in  Figure  4,  but  also  in  terms  of  CPU  time  and  the  number  of  matrix-vector  multiplications  that  we 
are  showing  here,  we  will  exclude  SSI  from  further  comparisons  presented  hereinafter. 

We  next  evaluate  the  remaining  five  solvers  in  the  following  three  types  of  randomly  generated  scenarios 
constructed  by  Model  2: 

Type  I.  For  each  m  £  {1000,  2000, . . . ,  6000},  set  m  =  n,  the  decay  rate  parameter,  see  (44),  f3  =  1.01 
and  the  number  of  computed  singular  values  r  =  0.03m.  The  experiment  is  repeated  for  /3  =  1.1. 

Type  II.  Set  ( m,n )  =  (2000,4000),  /3  =  1.01,  and  vary  r  from  0.01m  to  0.06m.  The  experiment  is 
repeated  for  f3  =  1.1. 

Type  III.  Set  (m,  n)  =  (2000, 4000),  r  =  60,  and  let  f3  =  1.01  +  0.03i  for  i  =  0, 1,  •  •  •  ,5. 

Comparison  results  from  the  above  three  groups  of  tests  are  given  in  Figures  5,  6  and  7,  respectively. 
The  following  observations  should  be  clear. 

•  All  solvers  achieved  a  good  accuracy  of  10“ 12  except  for  one  case  where  LOBPCG  failed  to  do  so. 

•  The  two  Lanczos-based  solvers  EIGS  and  LANSVD  generally  requires  less  matrix-vector  multipli¬ 
cations  than  the  three  subspace-iteration  based  solvers  LMSVD,  SSI+C  and  LOBPCG,  though  this 
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Figure  5:  Comparison  with  varying  matrix  dimension  (Type  I) 


(a)  CPU  time:  /?  =  1.01 


(b)  relative  error:  /3  =  1.01 


(c)  Matrix-vector  multiplications 


(d)  CPU  time:  fi  =  1.1  (e)  relative  error:  /3  =  1.1  (f)  Matrix- vector  multiplications 


advantage  does  not  necessarily  mean  a  higher  efficiency  in  CPU  time. 

•  Relatively  speaking,  small-decay  cases  are  more  favorable  to  the  Lanczos-based  solvers,  while  large- 
decay  cases  more  favorable  to  the  subspace-iteration  based  solvers. 

In  terms  of  the  performance  of  LMS  VD,  we  can  safely  make  the  following  observations. 

•  On  large-decay  matrices,  LMSVD  takes  as  few  iterations  as  SSI+C  (though  iteration  number  is  not 
presented  here),  while  on  small-decay  matrices  when  SSI+C  and  LOBPCG  become  really  slow, 
LMSVD  remains  competitive  with  EIGS  and  LANSVD. 

•  In  all  the  test  instances  of  small  or  large  decay,  the  time  efficiency  of  LMSVD  is  always  close  to  the 
best,  if  not  the  best  by  itself,  among  the  five  tested  solvers. 

As  is  noted,  when  one  matrix-block  multiplication  AAJX  is  counted  as  k  matrix-vector  multiplications, 
the  two  Lanczos-iteration  based  solvers,  EIGS  and  LANSVD,  require  considerably  fewer  matrix-vector 
multiplications.  Even  so,  LMSVD  can  still  be  be  faster  in  terms  of  CPU  time.  Presumably,  this  has  to  do  with 
memory  access  patterns  in  a  memory  hierarchy  of  modern  computer  architecture  which  is  in  favor  of  doing 
one  matrix-block  multiplication  at  once  than  doing  k  individual  matrix- vector  multiplications  sequentially. 
In  addition,  block  matrix  multiplications  facilitate  the  use  of  high  efficiency  level  3  BLAS  subroutines  in  the 
linear  algebra  package  LAPACK  [1]  used  by  Matlab. 

An  known  advantage  of  subspace-iteration  based  algorithms  over  Lanczos  based  ones  is  that  the  former 
can  easily  benefit  from  a  good  warm  starts,  whenever  available.  To  demonstrate  this,  we  construct  a  sequence 
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Figure  6:  Comparison  with  varying  number  of  computed  singular  values  (Type  II) 


matrix  size:  2000  x  4000;  p  =  1.01 


Number  of  dominant  singular  values  computed,  r 


Number  of  dominant  singular  values  computed,  r 


(a)  CPU  time:  ^  =  1.01 


(b)  Relative  error:  0  =  1.01 


(c)  Matrix-vector  multiplications 


matrix  size:  2000  x  4000;  p  =  1 .1 


Number  of  dominant  singular  values  computed,  r 


Number  of  dominant  singular  values  computed,  r 


Number  of  dominant  singular  values  computed,  r 


(d)  CPU  time:  0  =  1.1 


(e)  Relative  error:  0  m  1.1 


(f)  Matrix-vector  multiplications 


of  2000  by  4000  matrices,  starting  from  a  random  matrix  A ^  generated  with  3  =  1.01  and  each  subsequent 
matrix  taking  the  form  of  A~>+1>  =  A^  +  57TT  ||vkCO||  ’  f°r  3  =  1,  ■  ■  •  ,14,  where  is  a  random  matrix 
with  i.i.d.  standard  Gaussian  elements.  For  the  three  subspace-iteration  based  algorithms,  at  each  step,  we 
set  the  initial  guess  as  the  output  of  the  previous  step.  The  results  are  plotted  in  Figure  8.  As  we  can  see,  as 
the  sequence  ’’converges”,  the  cost  of  solving  each  problem  generally  decreases  for  the  subspace-iteration 
based  algorithms,  while  it  remains  flat  for  EIGS  and  LANSVD. 

Finally,  we  present  an  overall  evaluation  of  the  five  solvers  using  performance  profiles  introduced  by 
More  and  Dolan  [5].  These  profiles  provide  a  way  to  graphically  compare  a  performance  quantity,  say  tPjS 
representing  the  number  of  iterations  or  CPU  time  required  by  solver  s  to  solve  problem  p  for  a  group  of 
solvers  on  a  set  of  test  problems. 

Let  us  define  rPjS  to  be  the  ratio  between  the  quantity  tPjS,  obtained  on  problem  p  by  solver  s,  over  the 
lowest  such  quantity  obtained  by  any  of  the  solvers  on  problem  p,  i.e.,  rPyS  :=  iP)S/mins{iP)S}.  Whenever 
solver  s  fails  to  solve  problem  p,  the  ratio  rPtS  is  set  to  infinity  or  some  sufficiently  large  number.  Then,  for 
r  >  0,  the  ratio 

number  of  problems  where  rp  s  <  r 

\'Y)  _ — - 

total  number  of  problems 

is  the  fraction  of  the  test  problems  that  were  solved  by  solver  s  within  a  factor  r  >  1  of  the  performance 
obtained  by  the  best  solver.  The  performance  plots  present  vrs(r)  for  each  solver  s  as  a  function  of  r.  The 
curves  are  always  monotonically  nondecreasing,  and  the  closer  a  curve  to  the  unity,  the  better. 

In  this  experiment,  540  test  matrices  are  generated  using  Model  2.  The  sizes  of  the  test  matrices  are 
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Figure  7:  Comparison  with  varying  decay  parameter  /3  (Type  III) 


matrix  size:  2000  x  4000:  r  =  60 


Dominant  singular  values  decaying  parameter,  p 


(a)  CPU  time  in  seconds 


(b)  Relative  error 


(c)  Matrix-vector  multiplications 


Figure  8:  warm  start. 


(a)  CPU  time  in  seconds 


(b)  Relative  error 


(c )  Matrix- vector  multiplications 


(m,  n)  where  m  ranges  from  2000  to  6000  increment  1000,  and  n  ranges  from  m  to  6000  increment  1000 
as  well.  Hence,  there  are  15  types  of  matrix  shapes.  For  each  of  these  matrices,  we  let  r  range  from  0.01m 
to  0.0(jm  increment  0.01m,  and  (3  range  from  1.01  to  1.15  increment  0.03.  In  total,  this  procedure  generates 
15*6*6  =  540  matrices. 

Performance  profiles  on  CPU  time  and  the  number  of  matrix-vector  multiplications  are  given  in  Figure 
9.  Plot  (a)  on  the  left  show  that,  on  the  540  tested  problems,  (i)  LMSVD  is  the  best  performer  in  terms  of 
CPU  time,  and  (ii)  it  always  solved  problems  in  no  more  than  twice  of  the  fastest  time  among  the  five  solvers. 
On  the  other  hand,  plot  (b)  shows  that  both  EIGS  and  LANSVD  require  fewer  matrix-vector  multiplications 
than  LMSVD  does,  while  LMSVD  uses  fewer  matrix-vector  multiplications  than  the  other  two  subspace- 
iteration  based  algorithms  SSI+C  and  LOBPCG. 

In  addition,  the  average  relative  error  with  respect  to  the  solutions  obtained  by  EIGS  over  the  540  test 
problems  are  reported  in  Table  2. 

5.4  Performance  comparison  on  an  application  problem 

In  this  subsection,  we  test  the  performance  of  LMSVD  on  problems  from  robust  principal  component  pur¬ 
suit,  which  is  a  recent  technique  with  potential  applications  in  image  and  video  analysis  and  other  applica- 
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Figure  9:  Performance  profile 


Performance  profile  of  540  matrices  on  CPU  time  Performance  profile  of  540  matrices  on  matrix-vector  multiplications 


(a)  CPU  time 


(b)  the  number  of  matrix- vector  multiplications 


Table  2:  Average  relative  error  with  respect  to  EIGS 


LMSVD 

SSI+C 

LANSVD 

LOBPCG 

6.5675e-015 

9.3449e-015 

9.7473e-015 

6.5973e-014 

tions  where  data  is  either  incomplete  or  corrupted.  Specifically,  the  problem  is  to  separate  a  low-rank  matrix 
Lo  and  a  sparse  matrix  ,S’o  from  their  given  sum  D  =  To  +  So  €  Mmxn.  It  has  been  shown  in  [3]  that,  under 
some  suitable  conditions,  its  solution  can  be  found  by  solving  the  convex  optimization  problem: 

min  ||L||*  + /r||S||i  s.t.  L  +  S  =  D,  (45) 

L,SeRmxn 

where  fi>  0  is  a  proper  weighting  factor,  ||L||*  is  the  nuclear  norm  of  L  (the  sum  of  its  singular  values)  and 
115111  is  the  sum  of  the  absolute  values  of  all  entries  of  S  (not  the  matrix  1-norm). 

There  are  a  number  of  methods  proposed  for  solving  (45),  such  as  the  exact  and  inexact  augmented 
Lagrange  multiplier  (ALM  and  IALM)  algorithms  proposed  in  [15]  and  [29],  respectively.  However,  no 
matter  what  method  is  used  to  solve  (45),  the  main  cost  is  the  computation  of  the  dominant  SVD  of  certain 
matrices  related  to  the  nuclear  norm  term.  Here  we  study  how  the  performance  of  the  IALM  solver  changes 
as  its  default  SVD  solver  LANSVD  is  replaced  by  another  SVD  solver,  in  particular  by  LMSVD. 

We  first  test  the  algorithms  on  matrices  of  the  form  D  =  Lo  +  So,  where  Lo  is  rank -r  and  So  is 
sparse.  Specifically,  Lo  =  XYJ ,  where  X  and  Y  are  n  x  r  i.i.d.  standard  Gaussian  random  matrices; 
So  =  tS,  where  S  is  a  sparse  matrix  whose  nonzero  positions  are  uniformly  sampled  and  nonzero  element 
values  are  independently  chosen  from  the  standard  Gaussian  distribution  with  variance  1/n,  and  r  is  a 
scalar  which  makes  Sq  roughly  the  same  magnitudes  as  Lq.  The  sizes  of  the  test  matrices  are  m  =  n  = 
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500, 1000, 1500, ...,  4000.  We  test  two  groups  of  test  problems.  In  each  of  them,  the  rank  ratio  parameter 
Tr  =  r/n  and  the  density  parameter  rs  =  nnz(S'o)/n2  are  set  to  (0.05,  0.10)  and  (0.10,  0.05),  respectively. 
In  all  cases,  the  balancing  parameter  fi  is  fixed  as  1/n. 

We  run  IALM  with  four  SVD  solvers,  LMSVD,  LANSVD,  EIGS  and  LOBPCG  on  the  first  group  of 
test  problems,  and  with  the  first  three  solvers  on  the  second  group,  excluding  LOBPCG  due  to  a  runtime 
eiTor  message  it  gives.  Since  all  solvers  achieved  a  very  similar  accuracy,  we  only  summarize  the  CPU  time 
results  in  Figure  10.  We  observe  that  using  LMSVD  in  IALM  reduces  the  CPU  time  by  30-50%  comparing 
to  using  LANSVD.  The  performance  of  IALM  using  EIGS  is  similar  to  that  of  using  LANSVD,  while  the 
results  of  using  LOBPCG  are  not  competitive. 


Figure  10:  CPU  time  results  on  randomly  generated  matrix  separation  problems. 


(a)  CPU  time  (group  1 )  (b)  CPU  time  (group  2) 


We  next  consider  the  case  of  recovering  an  image  (as  an  approximate  low-rank  matrix)  from  its  sum 
with  a  sparse  random  matrix.  We  test  two  images:  a  rank-2  image  “checkerboard”  of  size  512  x  512 
and  an  approximately  low-rank  image  “brickwall”  of  size  772  x  1165,  shown  in  Figures  11  (a)  and  (b), 
respectively.  The  images  are  corrupted  by  adding  a  sparse  matrix  whose  nonzero  entries  take  random  values 
uniformly  distributed  in  [0, 1]  and  the  locations  of  the  nonzero  entries  are  uniformly  random.  The  density 
level  nnz ( S* ) /m2  was  set  to  5%.  On  each  these  problems,  IALM  calls  a  SVD  solver  about  18  times.  Table 
3  reports  the  average  number  of  singular  values,  denoted  by  “av.dsv”,  computed  at  each  iteration,  the  relative 
error,  denoted  by  “rel.err”,  of  the  recovered  image,  and  the  CPU  time  used.  The  recovered  images  by  IALM 
with  LMSVD  are  depicted  in  Figures  1 1  (a)  and  (b).  Since  the  recovered  images  with  other  SVD  solvers  are 
almost  visually  identical,  they  are  not  included  here  for  the  sake  of  space. 


Problem  Name 

IALM+LMSVD 

IALM+LANSVD 

IALM+LOBPCG 

IALM+EIGS 

av.dsv 

rel.err 

CPU 

av.dsv 

rel.err 

CPU 

av.dsv 

rel.err 

CPU 

av.dsv 

rel.err 

CPU 

checkerboard 

59.0 

0.23 

2.95 

59.0 

0.23 

7.91 

59.0 

0.23 

7.09 

59.0 

0.23 

3.42 

brickwall 

76.8 

0.29 

10.78 

76.8 

0.29 

23.55 

76.8 

0.29 

22.43 

76.8 

0.29 

17.05 

video-hall 

56.4 

- 

23.74 

56.4 

- 

39.72 

56.4 

- 

115.68 

56.4 

- 

53.64 

Table  3:  image  restoration  and  video  separation 


Finally,  with  different  SVD  solvers  we  apply  IALM  to  a  video  separation  problem  in  [3],  which  aims 
to  separate  a  video  into  a  static  background  and  moving  objects.  Every  frame  of  a  video  is  reshaped  into  a 
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long  column  vector  and  then  collected  into  a  matrix.  As  such,  the  number  of  rows  in  the  matrix  equals  to 
the  number  of  pixels  and  the  number  of  columns  equals  to  the  number  of  frames  in  a  video.  We  take  only 
part  of  the  clip  with  500  frames  to  reduce  the  storage  and  computation  required,  resulting  in  a  25344  x  500 
matrix  separation  problem.  In  this  test,  IALM  needs  to  call  a  SVD  solver  7  times.  The  average  number  of 
dominant  SVD  computed  at  each  iteration  and  the  CPU  time  arc  again  included  in  Table  3  as  the  last  row. 
We  observe  that  IALM  with  LMSVD  requires  the  least  amount  of  CPU  time  than  with  other  solvers.  A  few 
recovered  frames  by  IALM  with  LMSVD  arc  given  in  Figure  1 1. 

6  Conclusions 

We  propose  and  study  a  limited  memory  subspace  optimization  technique  to  accelerate  the  classical  simul¬ 
taneous  subspace  iteration  (SSI)  method  for  computing  truncated  singular  value  decompositions  (SVDs)  of 
large  and  unstructured  matrices.  The  main  idea  is  to  introduce  an  intermediate  iterate  which  maximizes 
the  Rayleigh-Ritz  function  in  a  subspace  spanned  by  a  few  previous  iterates  without  introducing  any  ex¬ 
tra  matrix-block  multiplication.  A  stable  algorithm  is  constructed  to  produce  high  precision  solutions  if  so 
specified,  and  a  general  convergence  result  is  established  based  on  minimal  assumptions. 

Comprehensive  numerical  experiments  are  conducted  to  evaluate  the  performance  of  our  new  truncated 
SVD  solver  LMSVD  in  comparison  to  a  few  state-of-the-art  solvers.  Numerical  results  indicate  that  LMSVD 
does  accelerate  SSI  (or  SSI  plus  Chebyshev  acceleration)  significantly  to  a  point  where  its  performance 
becomes  competitive,  ofter  superior,  to  the  best  publicly  available  solvers  on  a  wide  range  of  unstructured 
matrices  under  the  MATLAB  environment.  We  hope  that  LMSVD  will  become  a  new  addition  to  the 
computational  toolbox  useful  in  data-intensive  applications. 

Finally,  we  opine  that  on  unstructured  matrices  an  SSI-based  solver  like  LMSVD,  rich  in  level  3  BLAS 
operations,  offers  a  greater  opportunity  to  be  efficiently  parallelized  on  massively  parallel  computers  than 
that  offered  by  solvers  based  on  Lanczos  iterations. 
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Figure  11:  Image  restoration  and  video  separation  problems. 
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